label accuracy
Robustness of conditional GANs to noisy labels
Kiran K. Thekumparampil, Ashish Khetan, Zinan Lin, Sewoong Oh
We study the problem of learning conditional generators from noisy labeled samples, where the labels are corrupted by random noise. A standard training of conditional GANs will not only produce samples with wrong labels, but also generate poor quality samples. We consider two scenarios, depending on whether the noise model is known or not. When the distribution of the noise is known, we introduce a novel architecture which we call Robust Conditional GAN (RCGAN). The main idea is to corrupt the label of the generated sample before feeding to the adversarial discriminator, forcing the generator to produce samples with clean labels. This approach of passing through a matching noisy channel is justified by accompanying multiplicative approximation bounds between the loss of the RCGAN and the distance between the clean real distribution and the generator distribution. This shows that the proposed approach is robust, when used with a carefully chosen discriminator architecture, known as projection discriminator. When the distribution of the noise is not known, we provide an extension of our architecture, which we call RCGAN-U, that learns the noise model simultaneously while training the generator. We show experimentally on MNIST and CIFAR-10 datasets that both the approaches consistently improve upon baseline approaches, and RCGAN-U closely matches the performance of RCGAN.
- North America > United States > Illinois (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > Canada > Quebec > Montreal (0.04)
Veri-R1: Toward Precise and Faithful Claim Verification via Online Reinforcement Learning
He, Qi, Qian, Cheng, Chen, Xiusi, He, Bingxiang, Fung, Yi R., Ji, Heng
Claim verification with large language models (LLMs) has recently attracted growing attention, due to their strong reasoning capabilities and transparent verification processes compared to traditional answer-only judgments. However, existing approaches to online claim verification, which requires iterative evidence retrieval and reasoning, still mainly rely on prompt engineering or pre-designed reasoning workflows, without unified training to improve necessary skills. Therefore, we introduce Veri-R1, an online reinforcement learning (RL) framework that enables an LLM to interact with a search engine and to receive reward signals that explicitly shape its planning, retrieval, and reasoning behaviors. This dynamic interaction of LLM with retrieval systems more accurately reflects real-world verification scenarios and fosters comprehensive verification skills. Empirical results show that Veri-R1 improves joint accuracy by up to 30% and doubles the evidence score, often surpassing its larger-scale model counterparts. Ablation studies further reveal the impact of reward components, and the link between output logits and label accuracy. Our results highlight the effectiveness of online RL for precise and faithful claim verification, providing an important foundation for future research. We release our code to support community progress in LLM empowered claim verification.
- North America > United States > West Virginia > Kanawha County > Charleston (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- Asia > China > Hong Kong (0.04)
- Research Report > New Finding (0.86)
- Instructional Material > Online (0.61)
- Leisure & Entertainment > Sports (0.46)
- Law (0.46)
- Government (0.46)
An Empirical Study of Vulnerabilities in Python Packages and Their Detection
Quan, Haowei, Wang, Junjie, Li, Xinzhe, Zhuo, Terry Yue, Chen, Xiao, Du, Xiaoning
In the rapidly evolving software development landscape, Python stands out for its simplicity, versatility, and extensive ecosystem. Python packages, as units of organization, reusability, and distribution, have become a pressing concern, highlighted by the considerable number of vulnerability reports. As a scripting language, Python often cooperates with other languages for performance or interoperability. This adds complexity to the vulnerabilities inherent to Python packages, and the effectiveness of current vulnerability detection tools remains underexplored. This paper addresses these gaps by introducing PyVul, the first comprehensive benchmark suite of Python-package vulnerabilities. PyVul includes 1,157 publicly reported, developer-verified vulnerabilities, each linked to its affected packages. To accommodate diverse detection techniques, it provides annotations at both commit and function levels. An LLM-assisted data cleansing method is incorporated to improve label accuracy, achieving 100% commit-level and 94% function-level accuracy, establishing PyVul as the most precise large-scale Python vulnerability benchmark. We further carry out a distribution analysis of PyVul, which demonstrates that vulnerabilities in Python packages involve multiple programming languages and exhibit a wide variety of types. Moreover, our analysis reveals that multi-lingual Python packages are potentially more susceptible to vulnerabilities. Evaluation of state-of-the-art detectors using this benchmark reveals a significant discrepancy between the capabilities of existing tools and the demands of effectively identifying real-world security issues in Python packages. Additionally, we conduct an empirical review of the top-ranked CWEs observed in Python packages, to diagnose the fine-grained limitations of current detection tools and highlight the necessity for future advancements in the field.
- North America > United States > District of Columbia > Washington (0.05)
- Asia > China > Tianjin Province > Tianjin (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (2 more...)
Addressing Leakage in Concept Bottleneck Models
Concept bottleneck models (CBMs) (Chen et al., 2020; Koh et al., 2020; Y eh et al., 2020; Lage & Doshi-V elez, 2020; Wang et al., 2017) propose explicitly aligning the intermediate layers of a The concept bottleneck model (CBM) makes its prediction in two stages. While soft concepts may improve predictive performance, this improvement comes at a cost.
The Accuracy Cost of Weakness: A Theoretical Analysis of Fixed-Segment Weak Labeling for Events in Time
Martinsson, John, Mogren, Olof, Virtanen, Tuomas, Sandsten, Maria
Accurate labels are critical for deriving robust machine learning models. Labels are used to train supervised learning models and to evaluate most machine learning paradigms. In this paper, we model the accuracy and cost of a common weak labeling process where annotators assign presence or absence labels to fixed-length data segments for a given event class. The annotator labels a segment as "present" if it sufficiently covers an event from that class, e.g., a birdsong sound event in audio data. We analyze how the segment length affects the label accuracy and the required number of annotations, and compare this fixed-length labeling approach with an oracle method that uses the true event activations to construct the segments. Furthermore, we quantify the gap between these methods and verify that in most realistic scenarios the oracle method is better than the fixed-length labeling method in both accuracy and cost. Our findings provide a theoretical justification for adaptive weak labeling strategies that mimic the oracle process, and a foundation for optimizing weak labeling processes in sequence labeling tasks.
- Europe > Sweden (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- (4 more...)
A jury evaluation theorem
Majority voting (MV) is the prototypical ``wisdom of the crowd'' algorithm. Theorems considering when MV is optimal for group decisions date back to Condorcet's 1785 jury decision theorem. The same assumption of error independence used by Condorcet is used here to prove a jury evaluation theorem that does purely algebraic evaluation (AE). Three or more binary jurors are enough to obtain the only two possible statistics of their correctness on a joint test they took. AE is shown to be superior to MV since it allows one to choose the minority vote depending on how the jurors agree or disagree. In addition, AE is self-alarming about the failure of the error-independence assumption. Experiments labeling demographic datasets from the American Community Survey are carried out to compare MV and AE on nearly error-independent ensembles. In general, using algebraic evaluation leads to better classifier evaluations and group labeling decisions.
- Law (0.68)
- Government > Regional Government > North America Government > United States Government (0.46)
Enhancing SLM via ChatGPT and Dataset Augmentation
Pieper, Tom, Ballout, Mohamad, Krumnack, Ulf, Heidemann, Gunther, Kühnberger, Kai-Uwe
This paper explores the enhancement of small language models through strategic dataset augmentation via ChatGPT-3.5-Turbo, in the domain of Natural Language Inference (NLI). By employing knowledge distillation-based techniques and synthetic dataset augmentation, we aim to bridge the performance gap between large language models (LLMs) and small language models (SLMs) without the immense cost of human annotation. Our methods involve two forms of rationale generation--information extraction and informed reasoning--to enrich the ANLI dataset. We then fine-tune T5-Small on these augmented datasets, evaluating its performance against an established benchmark. Our findings reveal that the incorporation of synthetic rationales significantly improves the model's ability to comprehend natural language, leading to 1.3\% and 2.3\% higher classification accuracy, respectively, on the ANLI dataset, demonstrating the potential of leveraging LLMs for dataset augmentation. This approach not only enhances the performance of smaller models on complex tasks but also introduces a cost-effective method for fine-tuning smaller language models. By advancing our understanding of knowledge distillation and fine-tuning strategies, this work contributes to the ongoing effort to create more capable and efficient NLP systems.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England > Greater Manchester > Manchester (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- (2 more...)
Ensure Timeliness and Accuracy: A Novel Sliding Window Data Stream Paradigm for Live Streaming Recommendation
Liang, Fengqi, Zheng, Baigong, Zhao, Liqin, Zhou, Guorui, Wang, Qian, Niu, Yanan
Live streaming recommender system is specifically designed to recommend real-time live streaming of interest to users. Due to the dynamic changes of live content, improving the timeliness of the live streaming recommender system is a critical problem. Intuitively, the timeliness of the data determines the upper bound of the timeliness that models can learn. However, none of the previous works addresses the timeliness problem of the live streaming recommender system from the perspective of data stream design. Employing the conventional fixed window data stream paradigm introduces a trade-off dilemma between labeling accuracy and timeliness. In this paper, we propose a new data stream design paradigm, dubbed Sliver, that addresses the timeliness and accuracy problem of labels by reducing the window size and implementing a sliding window correspondingly. Meanwhile, we propose a time-sensitive re-reco strategy reducing the latency between request and impression to improve the timeliness of the recommendation service and features by periodically requesting the recommendation service. To demonstrate the effectiveness of our approach, we conduct offline experiments on a multi-task live streaming dataset with labeling timestamps collected from the Kuaishou live streaming platform. Experimental results demonstrate that Sliver outperforms two fixed-window data streams with varying window sizes across all targets in four typical multi-task recommendation models. Furthermore, we deployed Sliver on the Kuaishou live streaming platform. Results of the online A/B test show a significant improvement in click-through rate (CTR), and new follow number (NFN), further validating the effectiveness of Sliver.
- Asia > China > Beijing > Beijing (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Asia > Middle East > Jordan (0.04)
Freshness or Accuracy, Why Not Both? Addressing Delayed Feedback via Dynamic Graph Neural Networks
Zheng, Xiaolin, Wang, Zhongyu, Chen, Chaochao, Zhu, Feng, Qian, Jiashu
The delayed feedback problem is one of the most pressing challenges in predicting the conversion rate since users' conversions are always delayed in online commercial systems. Although new data are beneficial for continuous training, without complete feedback information, i.e., conversion labels, training algorithms may suffer from overwhelming fake negatives. Existing methods tend to use multitask learning or design data pipelines to solve the delayed feedback problem. However, these methods have a trade-off between data freshness and label accuracy. In this paper, we propose Delayed Feedback Modeling by Dynamic Graph Neural Network (DGDFEM). It includes three stages, i.e., preparing a data pipeline, building a dynamic graph, and training a CVR prediction model. In the model training, we propose a novel graph convolutional method named HLGCN, which leverages both high-pass and low-pass filters to deal with conversion and non-conversion relationships. The proposed method achieves both data freshness and label accuracy. We conduct extensive experiments on three industry datasets, which validate the consistent superiority of our method.
- Asia > China > Zhejiang Province > Hangzhou (0.04)
- North America > United States > New York > New York County > New York City (0.04)
Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions
Chung, John Joon Young, Kamar, Ece, Amershi, Saleema
Large language models (LLMs) can be used to generate text data for training and evaluating other models. However, creating high-quality datasets with LLMs can be challenging. In this work, we explore human-AI partnerships to facilitate high diversity and accuracy in LLM-based text data generation. We first examine two approaches to diversify text generation: 1) logit suppression, which minimizes the generation of languages that have already been frequently generated, and 2) temperature sampling, which flattens the token sampling probability. We found that diversification approaches can increase data diversity but often at the cost of data accuracy (i.e., text and labels being appropriate for the target domain). To address this issue, we examined two human interventions, 1) label replacement (LR), correcting misaligned labels, and 2) out-of-scope filtering (OOSF), removing instances that are out of the user's domain of interest or to which no considered label applies. With oracle studies, we found that LR increases the absolute accuracy of models trained with diversified datasets by 14.4%. Moreover, we found that some models trained with data generated with LR interventions outperformed LLM-based few-shot classification. In contrast, OOSF was not effective in increasing model accuracy, implying the need for future work in human-in-the-loop text data generation.
- North America > United States > Washington > King County > Seattle (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > New York > New York County > New York City (0.05)
- (12 more...)
- Media > Film (0.46)
- Leisure & Entertainment (0.46)